A Closer Look at Skip-gram Modelling

نویسندگان

David Guthrie

Ben Allison

Wei Liu

Louise Guthrie

Yorick Wilks

چکیده

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

I'm No Longer a Child: A Closer Look at the Interaction Between Iranian EFL University Students' Identities and Their Academic Performance

Although university EFL students represent a wide array of social and cultural identities, their multiple and diverse identities are not usually considered in foreign language classrooms. This qualitative case study attempted to examine identity conflicts experienced by Iranian EFL learners at the university context. To this end, two Shiraz University students' identities were investigated. Sem...

متن کامل

A Multitask Objective to Inject Lexical Contrast into Distributional Semantics

Distributional semantic models have trouble distinguishing strongly contrasting words (such as antonyms) from highly compatible ones (such as synonyms), because both kinds tend to occur in similar contexts in corpora. We introduce the multitask Lexical Contrast Model (mLCM), an extension of the effective Skip-gram method that optimizes semantic vectors on the joint tasks of predicting corpus co...

متن کامل

Explaining and Generalizing Skip-Gram through Exponential Family Principal Component Analysis

The popular skip-gram model induces word embeddings by exploiting the signal from word-context coocurrence. We offer a new interpretation of skip-gram based on exponential family PCA—a form of matrix factorization. This makes it clear that we can extend the skip-gram method to tensor factorization, in order to train embeddings through richer higher-order coocurrences, e.g., triples that include...

متن کامل

word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA

Mikolov et al. (2013) introduced the skip-gram formulation for neural word embeddings, wherein one tries to predict the context of a given word. Their negative-sampling algorithm improved the computational feasibility of training the embeddings. Due to their state-of-the-art performance on a number of tasks, there has been much research aimed at better understanding it. Goldberg and Levy (2014)...

متن کامل

Breaking Sticks and Ambiguities with Adaptive Skip-gram

The recently proposed Skip-gram model is a powerful method for learning high-dimensional word representations that capture rich semantic relationships between words. However, Skipgram as well as most prior work on learning word representations does not take into account word ambiguity and maintain only a single representation per word. Although a number of Skip-gram modifications were proposed ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

A Closer Look at Skip-gram Modelling

نویسندگان

چکیده

منابع مشابه

I'm No Longer a Child: A Closer Look at the Interaction Between Iranian EFL University Students' Identities and Their Academic Performance

A Multitask Objective to Inject Lexical Contrast into Distributional Semantics

Explaining and Generalizing Skip-Gram through Exponential Family Principal Component Analysis

word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA

Breaking Sticks and Ambiguities with Adaptive Skip-gram

عنوان ژورنال:

اشتراک گذاری